Systems and methods for synchronizing multiple processing engines of a microprocessor

ABSTRACT

Systems and methods for synchronizing multiple processing engines of a microprocessor. In a microprocessor engine employing processor extension logic, DMA engines are used to permit the processor extension logic to move data into and out of local memory independent of the main instruction pipeline. Synchronization between the extended instruction pipeline and DMA engines is performed to maximize simultaneous operation of these elements. The DMA engines includes a data-in and data-out engine each adapted to buffer at least one instruction in a queue. If, for each DMA engine, the queue is full and a new instruction is trying to enter the buffer, the DMA engine will cause the extended pipeline to pause execution until the current DMA operation is complete. This prevents data overwrites while maximizing simultaneous operation.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Patent Application No. 60/721,108 titled “SIMD Architecture and Associated Systems and Methods,” filed Sep. 28, 2005, the disclosure of which is hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

The invention relates generally to embedded microprocessor architecture and more specifically to systems and methods for synchronizing the operation of multiple processing engines in a microprocessor-based system.

BACKGROUND OF THE INVENTION

Processor extension logic is utilized to extend a microprocessor's capability.

Typically, this logic is in parallel and accessible by the main processor pipeline. It is often used to perform specific, repetitive, computationally intensive functions thereby freeing up the main processor pipeline.

A design issue that must be addressed in microprocessor architectures and microprocessor-based system in general that employ processor extension logic, such as an extended instruction pipeline that is distinct from the main instruction pipeline, is synchronization and control. It is difficult to balance the competing interests of simplifying implementation and debugging while maximizing parallelism.

Thus, there exists a need for a parallel pipeline architecture that can fully exploit the advantages of parallelism without suffering from the design complexity of loosely or completely decoupled pipelines.

SUMMARY OF THE INVENTION

At least one embodiment of the invention may provide a method for synchronization of multiple processing engines in an extended processor core. The method according to this embodiment may comprise placing direct memory access (DMA) functionality in a single instruction multiple data (SIMD) pipeline, where the DMA functionality comprises a data-in engine and a data-out engine, and each DMA engine is allowed to buffer at least one instruction issued to it in a queue without stopping the SIMD pipeline. The method may also comprise, when the DMA engine queue is full, and a new DMA instruction is trying to enter the queue, blocking the SIMD pipeline from executing any instructions that follow until the current DMA operation is complete, thereby allowing the DMA engine and SIMI pipeline to maximize parallel operation while still remaining synchronized.

Another embodiment of the invention provides a method for synchronizing multiple processing engines of a microprocessor. The method according to this embodiment comprises coupling an extended instruction pipeline to a main instruction pipeline, coupling direct memory access (DMA) engines to the extended instruction pipeline, buffering at least one instruction in a queue in the DMA engine without stopping the extended instruction pipeline, and blocking the extended instruction pipeline from further execution when a DMA engine queue is full and a new DMA instruction arrives at the queue until a current DMA operation is complete.

A further embodiment of the invention provides a multi-processing engine architecture for a microprocessor. The multi-processing engine architecture for a microprocessor according to this embodiment comprises a main instruction pipeline, an extended instruction pipeline coupled to the main instruction pipeline via an instruction queue, and direct memory access (DMA) engines coupled to the extended instruction pipeline, the DMA access engines comprising a data-in engine and a data-out engine, wherein each of the data-in and data-out engines comprise an instruction queue adapted to buffer at least one instruction

An additional embodiment of the invention provides, in a microprocessor having a main instruction pipeline and processor extension logic comprising an extended instruction pipeline that is coupled to the main instruction pipeline via an instruction queue, wherein the extended instruction pipeline is adapted to be selectively decoupled from the main instruction pipeline to perform autonomous operation, and where the extended instruction pipeline is further coupled to DMA engines for moving data into and moving data out of a local memory, a method for maximizing simultaneous operation of the extended instruction pipeline and the DMA engines. The method according to this embodiment comprises executing an instruction from the extended instruction pipeline requiring the DMA engine, buffering the instruction if sufficient queue space is available in the DMA engine, and preventing the extended instruction pipeline from further execution if insufficient queue space is available until a current DMA operation is complete, freeing up a space the queue to accept a blocked DMA instruction on the instruction pipeline, thereafter resuming execution of the extended processor pipeline.

These and other embodiments and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to facilitate a fuller understanding of the present disclosure, reference is now made to the accompanying drawings, in which like elements are referenced with like numerals. These drawings should not be construed as limiting the present disclosure, but are intended to be exemplary only

FIG. 1 is a functional block diagram illustrating a microprocessor-based system including a main processor core and a SIMD media accelerator according to at least one embodiment of the invention;

FIG. 2 is an instruction sequence flow diagram and corresponding event time line illustrating a method for synchronizing processing between DMA tasks and SIMD tasks according to at least one embodiment of the invention; and

FIG. 3 is a flow chart detailing steps of an exemplary method for synchronizing multiple processing engines in a microprocessor according to various embodiments of the invention.

DETAILED DESCRIPTION

The following description is intended to convey a thorough understanding of the embodiments described by providing a number of specific embodiments and details involving microprocessor architecture and systems and methods for synchronizing multiple processing engines in a microprocessor-based system. It should be appreciated, however, that the present invention is not limited to these specific embodiments and details, which are exemplary only. It is further understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the invention for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.

Commonly assigned U.S. patent application Ser. No. ______ titled “System and Method for Selectively Decoupling a Parallel Extended Processor Pipeline,” filed concurrently with this application is hereby incorporated by reference in its entirety into the disclosure of this application.

Referring now to FIG. 1, a functional block diagram illustrating a microprocessor-based system 5 including a main processor core 10 and a SIMD media accelerator 50 according to at least one embodiment of the invention is provided. The diagram illustrates a microprocessor 5 comprising a standard single instruction single data (SISD) processor core 10 having a multistage instruction pipeline 12 and a SIMD media engine 50. In various embodiments, the processor core 10 may be a processor core such as the ARC 700 embedded processor core available from ARC International Limited of Elstree, United Kingdom, and as described in provisional patent application No. 60/572,238 filed May 19, 2004 entitled “Microprocessor Architecture” which, is hereby incorporated by reference in its entirety. Alternatively, in various embodiments, the processor core may be a different processor core.

In various embodiments, a single instruction issued by the processor pipeline 12 may cause up to sixteen 16-bit elements to be operated on in parallel through the use of the 128-bit data path 55 in the media engine 50. In various embodiments, the SIMD engine 50 utilizes closely coupled memory units. In various embodiments, the SIMD data memory 52 (SDM) is a 128-bit wide data memory that provides low latency access to perform loads to and stores from the 128-bit vector register file 51. The SDM contents are transferable via a DMA unit 54 thereby freeing up the processor core 10 and the SIMD core 50. In various embodiments, the DMA unit 54 comprises a DMA in engine 61 and a DMA out engine 62. In various embodiments, both the DMA in engine 61 and DMA out engine 62 may comprise instruction queues (labeled Q in the Figure) for buffering one or more instructions. In various embodiments, a SIMD code memory 56 (SCM) allows the SIMD unit to fetch instructions from a localized code memory, allowing the SIMD pipeline to dynamically decouple from the processor core 10 resulting in truly parallel operation between the processor core and SIMD media engine as discussed in commonly assigned U.S. patent application Ser. No. ______, titled, “Systems and Methods for Recording Instruction Sequences in a Microprocessor Having a Dynamically Decoupleable Extended Instruction Pipeline,” filed concurrently herewith, the disclosure of which is hereby incorporated by reference in its entirety.

Therefore, in various embodiments, the microprocessor architecture according to various embodiments of the invention may permit the processor to operate in both closely coupled and decoupled modes of operation. In the closely coupled mode of operation, the SIMD program code fetch and program stream supply is exclusively handled by the processor core 10. In the decoupled mode of operation, the SIMD pipeline 53 executes code from a local memory 56 independent of the processor core 10. The processor core 10 may control the SIMD pipeline 53 to execute video tasks such as audio processing, entropy encoding/decoding, discrete cosine transforms (DCTs) and inverse DCTs, motion compensation and de-block filtering.

With continued reference to the microprocessor architecture in FIG. 1, the main processor pipeline 12 has been extended with a high performance SIMD engine 50 and two direct memory access (DMA) engines 61 and 62, one for moving data into a local memory, SIMD data memory (SDM), and one for moving data out of local memory. The SIMD engine 50 and DMA engines 61, 62 are all executing instructions that are fetched and issued from in the main processor pipeline 10. To achieve high performance, these individual engines need to be able operate in parallel, and hence, as discussed above, instruction queues (Q) are placed between the main processor core 10 and the SIMD engine 50, and between the SIMD 50 engine and the DMA engines 61, 62, so that they can all operate out of step of each other. In addition, in various embodiments, a local SIMD code memory (SCM) is introduced so that macros can be called and can be executed from these memories. This allows the main processor core, the SIMD engines and the DMA engines to execute out of step of each other.

As discussed above, operating the main pipeline, extended pipeline and DMA engines in parallel introduces the problem of synchronization. For example, a sequence of SIMD code segment will have to wait for a DMA operation to finish transferring data into the SDM, which is kicked off by the instruction just preceding it. On the other hand, the DMA engine cannot start transferring data out of the SDM until the previously issued SIMD code has been executed. This type of synchronization is normally performed by using software to probe status bits toggled by these engines, or by using interrupts and their associated service routines to kick off the dependent processes. Both of these solutions require large overheads in terms of cycles as well as coding effort to achieve the synchronization desired.

In order to reduce these overheads, in various embodiments of the invention, the DMA engines 61, 62 are placed in the SIMD pipeline 53 itself, but each DMA engine is allowed to buffer one or more instructions issued to it in a queue without stopping the SIMD pipeline execution. When the DMA engine instruction queue is full, the SIMD engine pipeline 53 will be blocked from executing further instructions only when another DMA instruction arrives at the DMA. This allows the software to be re-organized so that a SIMD code will have to wait for a DMA operation to complete, or vice versa, as long as a double or more buffering approach is used, that is, two or more buffers are used to allow overlapping of data transfer and data computation.

With continued reference to the processor architecture of FIG. 1, there are two DMA engines 61, 62, one for moving data into a local memory, one for moving data out of local memory. Each DMA channel is allowed to buffer at least one instruction in a queue. Suppose for example, that there are two independent video pixel data blocks to be processed, and that each requires multiple blocks of pixel data to be moved into local memory and to be processed, before moving the results out of local memory.

Referring to FIG. 2, this Figure illustrates an instruction sequence flow diagram 100 and corresponding event time line 110 illustrating a method for synchronizing processing between DMA tasks and SIMD tasks, with only one deep instruction queues in each DMA engines, according to at least one embodiment of the invention. Looking at the instruction sequence flow diagram 100, the DI2 DMA operation is blocked if the buffered DI1 DMA operation is not completed, causing the DI2 DMA instruction to be blocked from entering the DMA instruction queue, which in turn results in the S1 SIMD operation being blocked. Since S1 operation depends on data from DI1 operation, the blocking action prevents the S1 SIMD instruction sequence from proceeding until the DI1 operation is completed. The DI3 DMA operation is executed only after S1 is completed. This eliminates any chance of DI3 overwriting the same data region targeted by the DI1 operation before the data is used by the computation S1. By the time DI3 has completed, the DI2 operation would have completed, allowing S2 to start. If however, the DI2 operation is not completed, the DI3 operation will be blocked, preventing S2 from starting. Likewise, the DO operation is only executed when S4 has completed. It should be appreciated that in the timeline 110 of FIG. 2, DI2 and S1, DI3 and S2, and DI4 and S3 are shown as starting at the same time respectively. In actual operation, S1 will start one clock cycle after DI2, S2 will start one clock cycle after DI3, and S3 will start one clock cycle after DI4. The time line is intended to demonstrate that S1 cannot start before DI1 is complete, S2 can not start before DI2 is complete, S3 can not start before DI3 is complete, and S4 can not start before DI4 is complete.

This approach avoids the need of the main processor core from intervening continuously in order to achieve synchronization between the DMA unit and the SIMD pipeline. However, the processor core 10 does need to ensure that the instruction sequence sent uses this functionality to achieve the best performance by parallelizing SIMD and DMA operations. Thus, an advantage of this approach is that it facilitates the synchronization of SIMD and DMA operations in a multi-engine video processing core with minimal interaction between the main control processor core. This approach can be extended by increasing the depth of the DMA non-blocking instruction queue so as to allow more DMA instructions to be buffered in the DMA channels, allowing double, triple or more buffering.

Referring now to FIG. 3, this Figure is a flow chart of an exemplary method for synchronizing multiple processing engines in a microprocessor-based system according to at least one embodiment of the invention. FIG. 3 demonstrates a method for coding the instruction sequence to allow both the SIMD engine and DMA engines to operate simultaneously as much as possible. The method begins in step 200 and proceeds to step 205 where an instruction requiring the DMA engine is executed by the SIMD pipeline. In step 210, the SIMD pipeline accesses the required DMA engine queue. If in step 210, the DMA engine instruction queue is already full when it is accessed, the SIMD pipeline is paused from further execution, as described in step 215. In step 220, the SIMD waits for a free space in the instruction queue of the targeted DMA engine. In the meantime, the DMA engine corresponding to the target queue performs its current DMA operation instructed by the DMA instruction(s) already in the queue. After this operation is performed, the DMA engine instruction queue opens up a free space so that in step 225, the stalled DMA instruction can be buffered in the queue. The SIMD pipeline then resumes execution in step 230 after the DMA instruction has been buffered. Accordingly, through the various systems and methods disclosed herein, simultaneous operation of the SIMD pipeline and the DMA engines is maximized without the risk of overwrite.

The embodiments of the present inventions are not to be limited in scope by the specific embodiments described herein. For example, although many of the embodiments disclosed herein have been described with reference to systems and method for synchronizing multiple processing engines in a microprocessor-based system having a main instruction pipeline and an extended instruction pipeline, the principles herein are equally applicable to other aspects of microprocessor design and function. Indeed, various modifications of the embodiments of the present inventions, in addition to those described herein, will be apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such modifications are intended to fall within the scope of the following appended claims. Further, although some of the embodiments of the present invention have been described herein in the context of a particular implementation in a particular environment for a particular purpose, those of ordinary skill in the art will recognize that its usefulness is not limited thereto and that the embodiments of the present inventions can be beneficially implemented in any number of environments for any number of purposes. Accordingly, the claims set forth below should be construed in view of the full breath and spirit of the embodiments of the present inventions as disclosed herein. 

1. A method for synchronizing multiple processing engines of a microprocessor comprising: coupling an extended instruction pipeline to a main instruction pipeline; coupling direct memory access (DMA) engines to the extended instruction pipeline; buffering at least one instruction in the DMA engine, using a queue, without stopping the extended instruction pipeline; and blocking the extended instruction pipeline from further execution when a DMA engine instruction queue is full and a new DMA instruction arrives at the queue, until a current DMA operation is complete.
 2. The method according to claim 1, wherein coupling DMA engines to the extended instruction pipeline comprises coupling a DMA engine having a data-in channel for moving data into a local memory and a DMA engine having a data-out channel for moving data out of a local memory of the extended instruction pipeline.
 3. The method according to claim 2, wherein blocking the extended instruction pipeline from issuing subsequent instructions until a DMA operation is complete when any DMA engine instruction queue is full, and a new DMA instruction is being issued to it.
 4. The method according to claim 1, further comprising restarting execution of the extended instruction pipeline when the DMA operation is complete and the new DMA operation that was trying the enter the buffer successfully leaves the instruction pipeline and enters the instruction queue.
 5. A multi-processing engine architecture for a microprocessor comprising: a main instruction pipeline; an extended instruction pipeline coupled to the main instruction pipeline via an instruction queue; and direct memory access (DMA) engines coupled to the extended instruction pipeline, the DMA access engines comprising a data-in engine and a data-out engine, wherein each of the data-in and data-out engines comprise an instruction queue adapted to buffer at least one instruction.
 6. The architecture according to claim 5, wherein the DMA access engines are adapted to prevent the extended instruction pipeline from executing additional instructions when the DMA instruction queue is full and a new DMA instruction is blocked from entering the buffer, until a current DMA operation is completed allowing the blocked DMA operation to enter the buffer from the instruction pipeline.
 7. The architecture according to claim 6, wherein the DMA access engine is adapted to cause the extended instruction pipeline to resume execution once the DMA instruction trying to enter the DMA buffer that was blocked previously enters the DMA instruction buffer when the current DMA operation is completed.
 8. In a microprocessor having a main instruction pipeline and processor extension logic comprising an extended instruction pipeline that is coupled to the main instruction pipeline via an instruction queue, wherein the extended instruction pipeline is adapted to be selectively decoupled from the main instruction pipeline to perform autonomous operation, and where the extended instruction pipeline is further coupled to DMA engines for moving data into and moving data out of a local memory, a method for maximizing simultaneous operation of the extended instruction pipeline and the DMA engines comprising: executing an instruction from the extended instruction pipeline requiring the DMA engine; buffering the instruction if sufficient buffer space is available in the DMA engine instruction queue; and preventing the extended instruction pipeline from further execution if insufficient queue space is available until a current DMA operation is complete, freeing up a space in the queue to accept a blocked DMA instruction on the instruction pipeline, thereafter resuming execution of the extended processor pipeline. 